Sequence Similarity and Support Vector Machine for Protein Secondary Structure
نویسنده
چکیده
The majority of human coding regions have been sequenced and several genome sequencing projects have been completed. With the growth of large-scale sequencing data, an efficient approach to analyze protein is more important. Protein function and structure are foundations for drug design and protein-based product. However, it’s difficult to predict protein function and structure (three-dimension) directly from protein sequence. Therefore, analyzing protein secondary structure is indispensable. In the previous work, researchers always focused on classifying three states of protein secondary structure : helix, strand and coil classes. It’s a common classification problem for the prediction of protein secondary structure. Comparing with other machine learning methods for this problem, many studies usually ignore the protein local sequence/structure properties. It concerns the accuracy of prediction because there exists a large number of proteins that are homologous but whose sequences are only remotely related. In this thesis, we propose to use sequence similarity and Support Vector Machines (SVMs) to predict protein secondary structure. First, we try to encode the amino acids sequences( RS126 and CB513 ) and transform sequence segments into vectors for training. Second, we construct the SVM classifiers for classifying each residue of each sequence into the 3 secondary structure classes (i.e. H, E, or C). SVM has been successfully applied in pattern recognition problem. SVMs are learning systems that use a hypothesis space of linear functions in a high dimensional feature space, trained with a learning algorithm from optimization theory that implements a learning bias derived from statistical learning theory. It’s very suitable to compute with large-scale protein sequences. The challenges and comparisons of different approaches were demonstrated for protein secondary prediction.
منابع مشابه
Protein Secondary Structure Prediction: a Literature Review with Focus on Machine Learning Approaches
DNA sequence, containing all genetic traits is not a functional entity. Instead, it transfers to protein sequences by transcription and translation processes. This protein sequence takes on a 3D structure later, which is a functional unit and can manage biological interactions using the information encoded in DNA. Every life process one can figure is undertaken by proteins with specific functio...
متن کاملA machine learning information retrieval approach to protein fold recognition
MOTIVATION Recognizing proteins that have similar tertiary structure is the key step of template-based protein structure prediction methods. Traditionally, a variety of alignment methods are used to identify similar folds, based on sequence similarity and sequence-structure compatibility. Although these methods are complementary, their integration has not been thoroughly exploited. Statistical ...
متن کاملDistinguishing enzymes from non-enzymes via support vector machine
With many proteins sequenced, the ability of predicting protein function from sequence is becoming more and more important. Currently, methods for inference of the protein functional annotation are mostly based on identifying a known function protein which is similar to the query protein. However, for the proteins that are dissimilar or only similar to the unknown proteins, these methods will l...
متن کاملCurrent Methods for Protein Secondary-structure Prediction Based on Support Vector Machines
The desire to understand protein structure has produced many approaches over the last four decades since Blout et al. (1960) attempted to correlate the sequence information of amino acids with their structural elements (Casbon, 2002). Instead of costly and time-consuming experimental approaches, effective prediction methods have been developed continuously. With the help of growing databases an...
متن کاملA Hybrid Method for Protein Secondary Structure Prediction
Protein secondary structure can be used to help determine the tertiary structure via the fold recognition. Predicting the secondary structure from the protein sequence has attracted the attention of many researchers. Support Vector Machine (SVM) is a new learning algorithm based on statistical learning theory that has been successfully applied to the protein secondary structure prediction probl...
متن کامل